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Abstract 

Background: Ongoing improvements in throughput of the next-generation sequencing technologies challenge 
the current generation of de novo sequence assemblers. Most recent sequence assemblers are based on the 
construction of a de Bruijn graph. An alternative framework of growing interest is the assembly string graph, not 
necessitating a division of the reads into /c-mers, but requiring fast algorithms for the computation of suffix-prefix 
matches among all pairs of reads. 

Results: Here we present efficient methods for the construction of a string graph from a set of sequencing reads. 
Our approach employs suffix sorting and scanning methods to compute suffix-prefix matches. Transitive edges are 
recognized and eliminated early in the process and the graph is efficiently constructed including irreducible edges 
only. 

Conclusions: Our suffix-prefix match determination and string graph construction algorithms have been 
implemented in the software package Readjoiner. Comparison with existing string graph-based assemblers shows 
that Readjoiner is faster and more space efficient. Readjoiner is available at http://www.zbh.uni-hamburg.de/ 
readjoiner. 



Background 

The de novo sequence assembly problem is to recon- 
struct a target sequence from a set of sequence reads. 
The classical approach to de novo assembly consists of 
three phases: overlap, layout and consensus. During the 
overlap phase, suffix-prefix matches among all pairs of 
sequence reads are computed, and turned into an over- 
lap graph [1]. In the layout phase the location of the 
reads with respect to each other is determined. In the 
consensus phase the target sequence is reconstructed, by 
selecting a base for each position. 

The introduction of the massively parallel next- 
generation DNA sequencing technologies has led to a 
considerable increase in the amount of data typically gen- 
erated by sequencing experiments. For example, as of 
January 2012, the HiSeq2000 sequencer of Illumina deli- 
vers sets of 100 bp reads with a total length of up to 600 
Gbp [2]. Sequence analysis software tools developed only 
a few years ago are often unable to deal with such large 
amounts of short reads: This has led to a gap between the 
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ability to produce sequence data and the capability to as- 
semble and analyze them [3]. 

The computation of the overlap graph is the most time 
and space consuming of the three phases, and was con- 
sidered a bottleneck in the computation. Therefore, al- 
ternative methods were developed avoiding an explicit 
overlap computation. An approach which proved to be 
effective is based on the enumeration of all £-mers of 
the reads and their representation in a de Bruijn graph, 
as first proposed by [4]. This concept is applied in sev- 
eral popular short reads assemblers such as Velvet [5], 
EULER-SR [6] and Abyss [7]. 

The de Bruijn graph describing the /c-mer spectrum of 
the read set has interesting properties for the solution of 
the assembly problem, such as the collapsing of different 
instances of sequence repeats into common paths of the 
graph. However, reducing short reads into even shorter 
units compromises the ability of disambiguation of short 
repeats. Myers [8] presented an alternative framework, 
the assembly string graph. Like in the de Bruijn graph, 
repeats still collapse into common graph elements. 
There are two main advantages of the string graph, com- 
pared to the de Bruijn graph: At first, it does not require 
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to split the reads into k-mers. Secondly, a string graph 
always retains read coherence, i.e. each path in the string 
graph represents a valid assembly of the reads. 

Edena [9] was the first available implementation of a 
string graph-based assembler. It computes suffix-prefix 
matches using a suffix array [10] representing all suffixes 
of the reads. From these, the complete overlap graph is 
constructed before transitive edges are removed using 
an algorithm described in [8]. 

A more space-efficient approach to the string graph 
construction has been presented in [11] and was imple- 
mented in the open source String Graph Assembler 
(SGA) [12]. SGA computes suffix-prefix matches using 
an algorithm based on the Burrows and Wheeler trans- 
form (BWT), allowing to classify suffix-prefix matches as 
transitive or irreducible, so that the string graph can dir- 
ectly be constructed. 

Recently, a compact representation for exact-match 
overlap graphs has been described in [13], together with 
a fast construction algorithm, which has been implemen- 
ted in the string graph-based assembler LEAP. 

In this paper, we present new efficient algorithms for 
the computation of irreducible suffix-prefix matches and 
the construction of the assembly string graph. These are 
implemented in a new string graph based sequence 
assembler Readjoiner. To validate our approach, we 
compared Readjoiner with the current implementa- 
tions of Edena, LEAP and SGA. Readjoiner proved 
to be considerably faster than previous competitors, 
or uses less memory. In fact, Readjoiner is able to 
handle very large datasets using limited resources: 
for example, a short reads dataset consisting of 115 
Gbp could be assembled on a single core in 51 
hours using 52 GB RAM. 

All string graph-based assemblers aim at constructing 
the same graph: However, the algorithms and data struc- 
tures employed in Edena, LEAP, SGA and Readjoiner dif- 
fer considerably. LEAP employs a compact representation 
of the overlap graph, while Readjoiner circumvents the 
construction of the full overlap graph. Both Edena and 
SGA are based on explicit index structures (suffix array 
and FM-index, respectively) representing all suffixes of all 
reads in the read set, while Readjoiner enumerates and 
sorts only a proper subset of the suffixes of the reads, and 
efficiently inserts them into buckets, which can be pro- 
cessed independently from each other. 

Methods 

Basic definitions 

Let w be a string of length n of symbols over an al- 
phabet Z. w[i] denotes the ith symbol of w and w[i. . j] the 
substring of w from position i to /, 1 <i,j<n. w[l. . .i] is 
the prefix of w ending at position i and w\j. . .n] is 
the suffix of w starting at position /'. A substring of w 



is proper if it is different from w. A substring of w is 
internal if it is neither a prefix nor a suffix of w. 

A read r is a string over the alphabet {A, C, G, T) which 
is assumed to be sorted by the alphabetical order < such 
that A<C<G <T. < denotes the lexicographic order of 
all substrings of the reads induced by the alphabetical 
order < . Let n be the length of r. The reverse complement 
of r, denoted by r , is the sequence r[n\. . . r[l], where a 
indicates the Watson-Crick complement of base a. 

Computing suffix- and prefix-free read sets 

The first step of our approach for assembling a collec- 
tion of reads is to eliminate reads that are prefixes or 
suffixes of other reads. Here we describe a method to 
recognize these reads. Consider an ordered set TZ = 
(fi, . . . , r m ) of reads, possibly of variable length, in which 
some reads may occur more than once (so TZ is indeed a 
multiset). We assume that, for all i, l<i<m, the ith read 
J"; in TZ is virtually padded by a sentinel symbol $,• at the 
right end and that the alphabetical order < is extended 
such that A < C< G< T< $ x < $ 2 < • • • < $ m . 

We define a binary relation < on TZ such that r, < rj if 
and only if i < j. That is, < reflects the order of the reads 
in the input. TZ is prefix-free if for all reads r in TZ there is 
nor' in TZ\{r} such that r is a prefix of r' . 1Z is suffix-free 
if for all r in TZ there is no read r' in TZ\{r} such that r is 
a suffix of r' . 

To obtain a prefix- and suffix-free set of reads we lex- 
icographically sort all reads using a modified radixsort 
for strings, as described in [14]. In this algorithm, the 
strings to be sorted are first inserted into buckets 
according to their first character. Each bucket is then 
sorted recursively according to the next character of all 
reads in the bucket. A bucket always consists of reads 
which have a common prefix. Once a bucket is smaller 
than some constant, the remaining suffixes of the reads 
in the bucket are sorted by insertion sort [15]. 

During the sorting process, the length of the longest 
common prefix (lcp) of two lexicographically consecu- 
tive reads is calculated as a byproduct. For two lexico- 
graphically consecutive reads r and r' with an lcp of 
length £ = \r\, we can conclude that r is a prefix of r'. If 
£ < \r'\, then r is a proper prefix of r' and we mark r. If 
£= \r'\, then r and r' are identical and we mark the 
read which is larger according to the binary relation < . 

To handle reverse complements and to mark reads 
which are suffixes of other reads, one simply applies this 
method to the multiset TZ = (fx, . . . , r m , r m+ i, . . . , r2 m ) 
where r m+ i = fi for all i, l<i<m. As TZ includes the re- 
verse complements of the reads, the method also marks 
reads which are suffixes of other reads. This is due to 
the observation that if read r is a suffix of read r' , then r 
is a prefix of ¥ . 
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In a final step of the algorithm one eliminates all reads 
from 1Z which have been marked. The remaining un- 
marked reads from 1Z are processed further. The algo- 
rithm to compute a suffix- and prefix-free set of reads 
runs in 0{m^^j time, where A max is the maximum 
length of a read and co is the machine's word size. As we 
consider A max to be a constant (which does not imply 
that the reads are all of the same length), the algorithm 
runs in 0(m) time. 

Computing suffix-prefix matches 

Suppose that 1Z is a suffix- and prefix-free set of m 
reads. Let i min > 0 be the minimum length param- 
eter. The set SPM(1Z) of suffix-prefix matches {SPMs, 
for short) is the smallest set of triples (r, r',£)such 
that r,r'£lZ and strings u,v,w exist such that 
r=uv, r' =vw, and |v| =t>i min . £ is the length of a 
suffix-prefix match (r,r',t). The suffix-prefix match- 
ing problem is to find all suffix-prefix matches. As 
the reads of length smaller than l min cannot, by def- 
inition, contribute any SPM, we can ignore them and 
thus we assume that TZ only contains reads of length 
at least l min . 

The method to solve the suffix-prefix matching problem 
presented here consists of two main algorithms. The first 
algorithm identifies and lexicographically sorts all SPM- 
relevant suffixes, i.e. a subset of all suffixes of all reads 
from which one can compute all suffix-prefix matches. 
The second algorithm enumerates these matches given 
the sorted list of all SPM-relevant suffixes. 

Consider a suffix-prefix match (r,r',l). By definition, 
the suffix of length £ of r exactly matches the prefix of 
length £ of r' . Obviously, the suffix of r involved in the 
match starts at some position j, 2<j < \r\ — i m i n + 1 in r. 
This implies that r must be at least of length i min + 1. The 
suffix cannot start at the first position in r, as otherwise r 
would be a prefix of some other read, contradicting our 
assumption that 1Z is prefix-free. 

To enumerate the set of all suffix-prefix matches of 
length at least £ m ;„, we preprocess all reads and deter- 
mine all proper suffixes of the reads which may be 
involved in a suffix-prefix match. More precisely, for all 
reads r we determine all matching candidates, i.e. all 
proper suffixes 5 of r such that the length of s is at least 
i min and there is a read r' such that s and r' have a com- 
mon prefix of length at least k, where k is an user- 
defined parameter satisfying <:<m«{t m i»,f}. There are 
two reasons for imposing this constraint on k: First, we 
want to represent a string of length k over an alphabet 
of size 4 in one machine word, thus k < | . Second, the 
suffixes of the reads from which we take the prefixes of 
length k have minimum length £ mi „, thus we choose 



The set of all matching candidates and all reads forms 
the set of all (i m i n ,k)-SPM-relevant suffixes. For simpli- 
city sake, we use the notion SPM-relevant suffixes if i min 
and k are clear from the context. While all SPMs can be 
constructed from the SPM-relevant suffixes, not all 
SPM-relevant suffixes lead to an SPM. 

An efficient algorithm for identifying and sorting all 
SPM-relevant suffixes 

The first two phases of our algorithm follow a strategy 
that is borrowed from the counting sort algorithm [15]. 
Like this, our algorithm has a counting phase and an in- 
sertion phase. However, in our problem, the elements 
(i.e. SPM-relevant suffixes) to be sorted are only deter- 
mined during the algorithm. Moreover, the number of 
keys (i.e. initial Aimers) whose occurrences are counted 
is on the order of the number of elements to be sorted. 
Therefore, in a space efficient solution, it is not trivial to 
access a counter given a key. We have developed a time 
and space efficient method to access the counter for a 
key, exploiting the fact that counting and inserting the 
SPM-relevant suffixes does not have to be done immedi- 
ately. Instead, the items to be counted/ inserted are first 
buffered and then sorted. A linear scan then performs 
the counting or inserting step. 

In contrast to counting sort, our algorithm uses an 
extra sorting step to obtain the final order of elements 
pre-sorted in the insertion phase. Under the assumption 
that the maximum read length is a constant (which does 
not imply that the reads are all of the same length), our 
algorithm runs in 0(n) time and space, where n is the 
total length of all reads. To the best of our knowledge a 
method employing a similar strategy has not yet been 
developed for the suffix-prefix matching problem. 

We first give a description of our algorithm using string 
notation. In a separate section, we explain how to effi- 
ciently implement the algorithm. In the following, we only 
consider the reads in the forward direction. However, it is 
not difficult to extend our method to also incorporate the 
reverse complements of the reads and we comment on 
this issue at the end of the methods section. 

The initial k-mer of some sequence s is the prefix of s 
of length k. To determine the matching candidates effi- 
ciently, we first enumerate the initial /c-mers of all reads 
and store them in a table of size m. This can be done in 
0(m) time. The notion table size always refers to the 
number of entries in the table. The next step lexico- 
graphically sorts the /c-mers in the table in ascending 
order. This string sorting problem can be transformed 
into an integer sorting problem (see Implementation) 
which can be solved by radixsort [15] in 0(m) time and 
0(m) extra working space. 

In the next step, a linear scan of the sorted Ar-mers 
removes duplicates from the table and counts how many 
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times each Ar-mer occurs in the table. This scan requires 
0(m) time. Let d<m be the number of different /c-mers. 
These can be stored in a table K of size d. 

The counts for the elements in K require another table C 
of size d. In addition to the duplicate removal and counting, 
the linear scan of the sorted /c-mers constructs two sets 
P and Q, the size of which depends on two user defined 
parameters k' < k and k"< k. P is the set of all initial k'-mers 
of the reads. Q is the set of all A:-mers r\k — k" + 1 . . . k] 
for some r £lZ. We assume that elements can be added to 
P and Q in constant time and that membership in these 
sets can be decided in constant time. Thus the linear scan 
constructs P and Q in 0(m) time. As P is a subset of a set 
of size 4 k , P can be stored in 4* bits. Q requires 4 k bits. 

Up until now, only the initial k-mers of the reads were 
considered, resulting in a sorted table K of d non- 
redundant keys (i.e. initial Ar-mers of reads), a table C of 
size d for counting /c-mers and two sets P and Q. By con- 
struction, each count in C is at least 1 and the sum of 
the counts is m. The next task is to enumerate, for all reads 
r, the suffixes of r at all positions /, 2 < j < \r\ — £ ml „ + 1 . 
r has \r\ - t min such suffixes. For each such suffix s 
(which by construction is of length > £ ml „), one extracts 
two strings v = s[l. . .k'] and w = s[k — k" + 1 . . . k] . If v 
does not occur in P, then v is not a prefix of any read in 1Z 
and thus s is not a matching candidate and can be 
discarded. If w does not occur in Q, then w ^ 
r[k — k" + 1 . . . k] for all reads r £ 1Z and thus s is not a 
matching candidate and can be discarded. Thus P and Q 
serve as filters to efficiently detect suffixes which can be 
discarded. For read r the suffixes s and corresponding 
strings v and w can be enumerated in 0(\r\ - i m i n ) time. 
Checking membership in P and in Q requires 
constant time. Therefore, each read r is processed in 0(| 
A - V-min) time. Thus the enumeration and checking 
requires 0(^ r6 7j|'"| — m %-min) time altogether. 

The next task is to process a suffix, say s which has 
passed the P-filter and the Q-filter. That is, v = s[l. . .k'] £ P 
and w = s[k — k " + 1 . . . /c] £ Q holds. One now has to 
check if u = s[l. . ./c] occurs in K to verify if s is a matching 
candidate. If the latter is true, the appropriate counter 
needs to be incremented. Hence this is the counting phase 
of our algorithm. The simplest way to check the occurrence 
of u in K, is to perform a binary search, taking u as the key. 
However, this would require 0(logid) time for each /c-mer 
passing the filters. This is too slow. Using a hash table 
turned out to be too slow as well and would require too 
much extra space, which we do not want to afford. 

We propose an efficient method that works as follows: 
Store each /c-mer s[l../c] passing the P and the Q-filter in a 
buffer B of fixed size b = ^ for some constant y>l. Once 

B is full or all k-mers have been added to B, sort the 
elements in B in ascending lexicographic order. Then 
perform a binary search in K, but only for the first 



element in B, say x. As B is sorted, x is the smallest elem- 
ent. The binary search for x in K finds the smallest elem- 
ent in K greater than or equal to x using 0(log 2 d) time. If 
such an element occurs in K, say at index e, then simul- 
taneously scan B beginning with the first index and K be- 
ginning at index e. For any element in B that is equal to 
an element in K, say at index i in K, increment the coun- 
ter in C at the same index. 

This simultaneous linear scan of B and (a part of) K takes 
0(b + d) time and finds all /c-mers from B occurring in K. 
Once the scan and the associated increments are done, the 
buffer is emptied for the next round. Suppose that there are 
in total b* /c-mers that have passed B. Thus there are 1"^-] 
rounds filling the buffer. Each round is associated with a 
sorting step, a binary search and a linear scan. Sorting 
requires 0(b) time using radixsort. This gives a running 
time of 0(£ (b + log 2 d+{b + d))) = 0(^ (b + d)) = 
0(^{b + by)) = 0{b*{l+ y)) = 0(b*) . As b*<n := 
5^ r(E7 j|r|, the running time is linear in the total length of 
the reads. 

Once all reads have been processed, for any initial 
/c-mer u of any read, the following holds: If u is the iih ini- 
tial k-mer in K, then C[i] is the number of SPM-relevant 
suffixes of which u is a prefix. To prepare for the insertion 
phase, compute the partial sums of C in an array n of 
size d+l, such that ji[0] = C[0], n[i] = n[i — 1] + C[i] for 
all i, \ <i<d-\, and n[d] = n[d - 1]. n[d] is the number of 
all SPM-relevant suffixes. One creates a table S of size 
g: = n[d] to hold pairs of read numbers and read offsets. 
As in the counting phase, enumerate all suffixes of reads 
of length at least i min passing the P- and the Q-filter. Sup- 
pose that s is such a suffix of read number p and with 
read offset q. Let u be the initial Zr-mer of s. Then we store 
ip, q, u) in a buffer B ' of fixed size |. We choose this buffer 
size, as the elements in B ' require twice as much space as 
the elements in B. As in the counting phase, sort the buf- 
fer in lexicographic order of the k-mers it stores, and 
then process the buffer elements using the /c-mer, say u, 
as a key to determine if u matches some element in 
K, say at index i. Then insert (p, q) at index 7i[i] - 1 in S 
and decrement n[i]. 

After all b* elements passed the buffer and have been 
processed, 5 holds all SPM-relevant suffixes (represented 
by read numbers and read offsets) in lexicographic order 
of their initial /c-mers. Let u be the z'th k-mer in K. Then 
all SPM-relevant suffixes with common prefix u are 
stored in 5 from index jt[i] to n[i + 1] — 1. Thus S can 
uniquely be divided into buckets of SPM-relevant suf- 
fixes with the same initial /c-mer. Each such bucket can 
be sorted independently from all other buckets. More- 
over, each SPM-relevant suffix not occurring in the ith 
bucket, has an initial k-mer different from u and thus can- 
not have a common prefix of length > i min with the suf- 
fixes in the ith bucket. As a consequence, all suffix-prefix 
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matches are derivable from pairs of SPM-relevant suffixes 
occurring in the same bucket. Thus, the suffix-prefix 
matches problem can be divided into d subproblems, each 
consisting of the computation of suffix-prefix matches 
from a bucket of SPM-relevant suffixes. This problem is 
considered later. 

To sort the ith bucket one extracts the remaining suf- 
fixes relevant for sorting the bucket and stores them in a 
table. This strategy minimizes the number of slow ran- 
dom accesses to the reads. Consider the ith bucket and 
let (p, q) be one of the suffixes in the bucket, referring to 
the suffix of read r p at read offset q. Then extract the 
suffix of r p starting at position q + k. As the maximum 
read length is considered to be constant, the total size of 
the remaining suffixes to be stored is 0(n[i + 1] — Ji[i]). 
The remaining suffixes can be sorted using radixsort in 
0(n[i + 1] — Jt[i]) time. An additional linear time scan 
over the sorted suffixes of the bucket delivers a table L of 
size n[i + 1] — n[i] — 1, such that Lj is the length of the 
longest common prefix of the suffixes S[n[i] + j — 1] 
and S[n[i] + j] for all /, 1 < j < n[i + 1] - n[i] - 1. 

Sorting all remaining suffixes and computing the 
lcp-table L thus requires 0{ji max ) space and 

0 {T?i=o (^[i + !] - *M)J = 0(g) time where f3 max is 
the maximum size of a bucket and g is the total number 
of SPM-relevant suffixes. The bucket of sorted SPM- 
relevant suffixes and the corresponding table L are pro- 
cessed by Algorithm 2 described after the following imple- 
mentation section and Algorithm 3 described in 
Additional file 1, Section 7. 

All in all, our algorithm runs in 0(m + n+g) = 0(n) time 
and 0(m + 4? k + 4? k + f> max +g + n) space. As we choose 
k" < k' € 0{log 2 n) and m, g, and P max are all smaller than 
n, the space requirement is 0(n). Thus the algorithm 
for identifying and sorting all (£„,,„, k) -SPM-relevant 
suffixes is optimal. 

implementation 

We will now describe how to efficiently implement the 
algorithm described above. An essential technique used 
in our algorithm are integer codes for /r-mers. These are 
widely used in sequence processing. As we have three 
different mer-sizes (k, k', and /<") and dependencies be- 
tween the corresponding integer codes, we shortly de- 
scribe the technique here. In our problem, a /c-mer 
always refers to a sequence of which it is a prefix. There- 
fore, we introduce integer codes for strings of length > k: 
For all strings s of length at least k define the integer 

code ifik(s) = YM=i^ k ~ l v( s [i])> where <p is the mapping 
[A — ► 0, C— ► 1, G— ► 2, T— ► 3] uniquely assigning numbers 
from 0 to 3 to the bases in the alphabetical order of 
the bases. Note that only the first k symbols of 5 de- 
termine (pic(s), which is an integer in the range 



[0. . A k - 1]. For all strings s and s' of length at least k, 
s<s' implies ipk(s) < ip k (s'), where < denotes the lex- 
icographic order of strings and < denotes the order 
of integers. 

Besides ip k , we use the encodings ip k - and ip k - for 
some k',k"<k. (fk' encodes the prefix of s of length k' 
and is defined in analogy to <f k (replacing k by k'). 
encodes the suffix s[k — k" + 1 . . . k] of s[l. . .k] of 
length /<", i.e. p*„(s) = Z£i4*"~V(*[* - k" + i}). <^(s) 
and ¥>|»(s) can be computed from <p k (s) according to the 
following equations: 



<pt.(s)=(p k (s)mod4 k " (2) 

We implement A:-mers by their integer codes. Each in- 
teger code can be computed in constant time by extract- 
ing the appropriate sequence of consecutive bit pairs 
from a 2bit per base encoding of the read set. In our im- 
plementation, we use the representation and the appro- 
priate access functions from the GtEncseq software 
library [16]. As /c<| we can store each integer code in 
an integer of the machine's word size. We sort m integer 
codes for the initial /c-mers using quicksort, adapting the 
code from [17]. Our implementation works without re- 
cursion and uses an extra stack of size 0(log 2 m) to sort m 
integers. This small additional space requirement is the 
main reason to choose quicksort instead of radixsort, which 
is usually more than twice as fast, but requires 0{m) extra 
working space, which we do not want to afford. 

The sets P and Q are implemented by bit vectors v P and 
Vq of 4* and 4^ bits, respectively. Bit v P [q] is set if and 
only if q = ^ (r) for some r g TZ. Bit v Q [q] is set if and only 
if q = (p k k - (r) for some read r £ TZ . To obtain the bit 
index, one computes ^(s) and (p k k „(s) from f k (s) using 
Equations (1) and (2). Equation (1) can be implemented 
by a bitwise right shift of 2{k- k') bits. Equation (2) can be 
implemented by a bitwise and operation with the integer 
2 2k - 1. Thus, given the integer code for s, both ^ (s) 
and ipH'is) can be computed in constant time. Therefore, 
the sets P and Q can be constructed in 0(m) time and 
each access takes constant time. 

When determining the /c-mer codes in the counting 
phase and in the insertion phase, we sweep a window 
of width k over the sequence reads and compute the 
integer code for the sequence in the current window 
in constant time. 

We implement the counts by a byte array of size d 
and store counts larger than 255 in an additional hash 
table. Additional file 1, Section 1 gives the details. 

The partial sums in table n are bounded by g, the 
number of SPM-relevant suffixes. For large read sets, 
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g can be larger than 2 - 1. However, as the partial 
sum are strictly increasing, one can implement n by a 32 
bit integer table PS of size d+1, such that PS[i] = n[i] 
mod 2 32 for any i, 0<i<d and an additional integer table 
of size 2 max ^ l0 ^~ 32 ^ marking the boundaries of carry 
bits. Details are given in Additional file 1, Section 2. 

For the insertion phase we need a representation of 
the read set (2n bits), table K (2kd bits), set P and Q 
(4*' and 4*" bits, respectively), table n (32(rf + 1) bits) 
and table 5 of size g. As S holds pairs of read num- 
bers and read offsets, each entry in S is stored compactly 
in o= \log%m\ + \log 2 {\max- tmi„)] bits. This would 
give an integer array of size \^f \ if we would store 5 com- 
pletely. But we do not, as we employ a partitioning strategy, 
explained next. 

Although the data structures representing tables S, K, P 
and Ji are of different sizes, their access follows the same 
scheme: Suppose that i is the smallest index, such 
that | < 7r[i] . Roughly half of the suffixes to be inserted 
in S are placed in buckets of lower order (with index < i) 
and the other half are placed in buckets of higher order 
(with index > i). The buckets of lower order are associated 
with the /c-mers in K up to index i. Hence, for these, one 
needs table K and PS only up to index i. Let 5 be some suf- 
fix of length <l m i n such that <^*(s) <K[i]. To apply the 

P- filter to 5, one checks v P at index <4r, which is 

in the first half of vector v P . This strategy, dividing 
tables 5, K, P and n into q = 2 parts of roughly the same 
size, can be generalized to q > 2 parts. Each part is defined 
by a lower and an upper integer code and by correspond- 
ing lower and upper boundaries referring to sections of 
the four mentioned tables. Partitioning S means to only al- 
locate the maximum space for holding all buckets belong- 
ing to a single part. 

The four tables that can be split over q parts require 
h(g,k,d,k",0) = 2kd + 4 k " + 32(d+l)+g0 bits. Hence, 
in the insertion phase, our method requires In + 
4? k + bits, w here In + 4? k bits are for the repre- 

sentation of the reads and the set Q (which cannot 
be split). As go dominates all other terms, h(g,k,d,k" ,0) 
is much larger than 2n + 4? k so that the space gain of 
our partitioning strategy is obvious. As the space 
required for the insertion phase for any number of 
parts can be precalculated, one can choose a memory 
limit and calculate the minimal number of parts such 
that the limit is not exceeded. In particular, choosing 
the space peak of the counting phase as a memory 
limit for the insertion phase allows for balancing the 
space requirement of both phases. More details on 
the partitioning technique are given in Additional file 1, 
Section 3. 

An obvious disadvantage of the partitioning strat- 
egy (with, say q, parts) is the requirement of q scans 



over the read set. However, the sequential scan over 
the read set is very fast in practice and only makes 
up for a small part of the running time of the inser- 
tion phase. 

The expected size of a bucket to be sorted after the in- 
sertion phase is smaller than the average read length. 
The maximum bucket size (determining the space re- 
quirement for this phase) is 1 to 2 orders of magnitude 
smaller than d. As we can store f bases in one integer of 
to bits, the remaining suffixes (which form the sort keys) 
can be stored in fi max + 2) integers, where ji max is 

the maximum size of a bucket and A max is the maximum 
length of a read. The additional constant 2 is for the 
length of the remaining suffix, for the read number and 
the read offset. The sort keys are thus sequences of inte- 
gers of different length which have to be compared 
up to the longest prefix of the strings they encode. 
We use quicksort in which | bases are compared 
using a single integer comparison. As a side effect of 
the comparison of the suffixes, we obtain the longest 
common prefix of two compared suffixes in constant 
extra time, and store this in a table L of the size of 
the bucket. The suffixes in the bucket and the table L are 
passed to Algorithm 2, described next, and to Algorithm 3 
(Additional file 1, Section 7). 

An efficient algorithm for computing suffix-prefix matches 
from buckets of sorted SPM-relevant suffixes 

The input to the algorithm described next is a bucket of 
sorted SPM-relevant suffixes, with the corresponding table 
L, as computed by the algorithm of the previous subsec- 
tion. Consider the ith bucket in S and let Hj = S[n[i] + j] 
be the jth suffix in this bucket for all j, 0 <j < fi - 1 
where /? = n[i + 1] - is the size of the bucket. By 
construction, we have Hj_\ < Hj, Lj > k, and L ; is the 
length of the longest common prefix of Hj_i and Hi 
for /, 1 </</?- 1. 

Note that the bucket-wise computation does not de- 
liver the lcp-values of pairs of SPM-relevant suffixes on 
the boundary of the buckets. That is, for all *>0, the 
length of the longest common prefix of S[jr[i] - 1] and 
S[7r[z']] is not computed, because S[7r[i] - 1] is the last 
suffix of the (i'-l)th bucket and S[7r[/]] is the first 
suffix of the ith bucket. However, as both suffixes be- 
long to two different buckets, their longest common 
prefix is smaller than k (and thus smaller than £ ml „) 
and therefore not of interest for the suffix-prefix 
matching problem. 

The suffixes occurring in a bucket will be pro- 
cessed in nested intervals, called lcp-intervals, a no- 
tion introduced for enhanced suffix arrays by [18]. 
We generalize this notion to table H and L as 
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follows: An interval [e..f\, Q<e<f<ji-\, is an Icp- 
interval of lep-value £ if the following holds: 

• e = 0 orL e < £, 

• L q > £ for all q, e + 1 < q <f, 

• L q = £ for at least one q, e + 1 < q <f, 

• / = P - 1 orI/+i < i. 

We will also use the notation £ - [e.f\ for an lcp-interval 
[e.fl of lcp-value £. If £ - [e..f\ is an lcp-interval such that 
w = H e [l. . .£] is the longest common prefix of the suffixes 
H e ,H e+1 , . . .,Hp then [e..f\ is called the w-interval. 

An lcp-interval £' -[e'.f] is said to be embedded in 
an lcp-interval £ - [e..f\ if it is a proper subinterval of 
l-[e..f\ (i.e.,e<e' <f <f) and £'>£. The lcp-interval 
£ - [e.fl is then said to be enclosing [e'.f ]. If [e.f] encloses 
[e'.f] and there is no interval embedded in [e.f] that 
also encloses [e'.f], then [e'.f] is called a child inter- 
val of [e.f] and [e.f] is the parent interval of [e'.f]. 
We distinguish lcp-intervals from singleton intervals 
[e'] for any e ', 0<e',</3-l. [e] represents H e -. The 
parent interval of [e ] is the smallest lcp-interval [e.f] 
with e < e ' <f. 

This parent-child relationship of lcp-intervals with 
other lcp-intervals and singleton intervals constitutes a 
virtual tree which we call the lcp-interval tree for H and L. 
The root of this tree is the lcp-interval 0- [0..(/?- 1)]. The 
implicit edges to lcp-intervals are called branch-edges. 
The implicit edges to singleton-intervals are called leaf- 
edges. Additional file 1, Section 10 gives a comprehensive 
example illustrating these notions. 

Abouelhoda et al. ([18], Algorithm 4.4) present a linear 
time algorithm to compute the implicit branch-edges of 
the lcp-interval tree in bottom-up order. "When applied to 
a bucket of sorted suffixes, the algorithm performs a linear 
scan of tables H and L. In the eth iteration, 0 < e < /? - 2, it 
accesses the value L e+1 and H e . We have non-trivially 
extended the algorithm to additionally deliver leaf edges. 
The pseudocode, with some additions in the lines marked 
as new, is given in Algorithm 1 (Figure 1). We use the fol- 
lowing notation and operations: 

• A stack stores triples (£, e,f) representing an lcp- 
interval £ - [e..f\ . To access the elements of such a 
triple, say sv, we use the notation sv.lcp (for the lcp- 
value £), sv.lb (for the left boundary e) and sv.rb (for 
the right boundary f). 

• stack.push{e) pushes an element e onto the stack. 

• stack.pop pops an element from the stack and 
returns it. 

• stack.top returns a reference to the topmost element 
of the stack. 

• ± stands for an undefined value. 



• process Jeafedgeifirstedge, itv, (p, q)) processes an 
edge from the lcp-interval itv to the singleton 
interval representing the suffix r p [q. . .\r p \]. firstedge 
is true if and only if the edge is the first processed 
edge outgoing from itv. 

• process _branchedge(firstedge, itv, itv) processes an 
edge from the lcp-interval itv to the lcp-interval itv'. 
The value itv'.rb is defined and firstedge is true if and 
only if the edge is the first edge outgoing from itv. 

• process_lcpinterval{itv) processes the lcp-interval itv. 
The value itv.rb is defined. 

Depending on the application, we use different functions 
process Jeafedge, process_branchedge, and process Jcpinterval. 

Additional file 1, Section 4, explains why Algorithm 1 
also delivers the leaf edges of the lcp-interval tree in the 
correct bottom-up order. 

Consider a path in the lcp-interval tree from the root 
to a singleton interval [e'] representing H e - = r p [q. . .\r p \]. 
Let £ - [e.f] be an lcp-interval on this path, and consider 
the edge on this path outgoing from £ - [e.f]. If the edge 
goes to an lcp-interval of, say lcp-value £ ' , then the edge 
is implicitly labeled by the non-empty sequence 
r p [q + £. . . q + £' — 1] . Suppose the edge goes to a 
singleton interval: Then the edge is implicitly labeled by 
the non-empty sequence r p [q + £. . .1^1 - 1]$^. If q + i = \r p \, 
then r p [q + i. . .\r p \ - 1] is the empty string, which implies 
that the edge to the singleton interval is labeled by the sen- 
tinel $ p . Such an edge is a terminal edge for r p . If the read 
offset q is 0, we call [e'] a whole-read interval for r p , and 
the path in the lcp-interval tree from the root to [e ] a 
whole-read path for r p . 

Consider a suffix-prefix match ( r p , r., £ ) , such that the 
suffix w of r p of length £ has a prefix u of length k. Recall 
that u is the common prefix of all suffixes in the ith 
bucket. Due to the implicit padding of reads at their end, 
the symbol following w as a suffix of r p is % p . By defin- 
ition, w is also prefix of r, and the symbol in r, following 
this occurrence of w is different from $ p . Thus, there is a 
w-interval [e.f] in the lcp-interval tree for H and L. [e.f] 
is on the path from the root-interval to the whole-read 
leaf interval for r<. Moreover, there is a terminal edge for 
r p outgoing from [e.f]. Vice versa, an lcp-interval of lcp- 
value £ on the path to the whole-read leaf interval for r ; 
and with a terminal edge for r p identifies the suffix- 
prefix match ( r p , r p I ) . This observation about suffix- 
prefix matches is exploited in Algorithm 2 (Figure 2) 
which performs a bottom-up traversal of the lcp-interval 
tree for H and L, collecting whole-read leaves and ter- 
minal edges for lcp-intervals of lcp-value at least t min . 
More precisely, whenever a whole-read leaf for r p , l<p< 
m, is found (line 9), p is appended to the list W. With 
each lcp-interval itv on the stack used in the bottom-up 
traversal, an integer itv firstinW is associated. The 
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Input: tableLand sorted array// of SPM-relevant suffixes \5 

functions process- leaf edge, process -branchedge and process-lcpintcrval 
to process the enumerated items 

Output: Processed implicit branch- and leaf-edges of the lep-interval tree 
in bottom-up order 

1: stack '■=[} o empty stack 



2: lastitv := _L 

3: firstedgefromroot := true > No edge from root yet 

4: stack. push(0, 0, _L) > Push root-interval with undefined 7'6-value 
5: for e := 0 to 0 - 2 do 

6: if L e+ i < stack. top. lep then > new 

7: if stack. top. lep > 0 then t> new 

8: firstedge := false > new 

9: else > new 

10: firstedge :— firstedgefromroot t> new 

11: firstedgefromroot := false > new 

12: process-leafedge(firstedgc, stack. top. H e ) > new 

13: while Lp + \ < stack .top . lep do 

14: lastitv — stack.pop 

15: laslilv.rb := e 

16: process-lepinterval (lastitv) 

17: if L e+ i < stack. top. lep then 

18; if stack. top. lep > 0 then > new 

19: firstedge := false > new 

20: else D> new 

21: firstedge :— firstedgefromroot [> new 

22: firstedgefromroot := false t> new 

23: process_branchedge(firstedge, stack. top, lastitv) 

24: lastitv := _L 

25: if -Lp+i > stack. top. lep then 

26: if lastitv ^ _L then 

27: stack. push(L e+ i, lastitv. lb, _L) 

28: process-branchedge(true, stack.top, lastitv) 

29: lastitv := _L 

30: else 

31: stack. push(L e+ i, e, ±) 

32: procesS-leafedge(true, stack.top, H e ) t> new 



Figure 1 Algorithm 1. Bottom-up traversal algorithm for arrays of SPM-relevant suffixes. This is an extension of [18, Algorithm 4.4] with the 
additional lines marked as new. 



elements in W[itv.firstin W. . .\W\] are exactly the read 
numbers of whole-read leaves collected for itv. The 
value of itv.firstin W is set whenever the first edge out- 
going from itv is detected: If the first edge outgoing 
from itv is a leaf-edge, no previous whole-read leaf for 
itv has been processed: Thus | W\ + 1 is the first index 
in list W where the whole read leaf information (if 
any) for itv will be stored (see line 8). If the first edge 
is a branch-edge to lep-interval itv', then the corre- 
sponding subset of W for itv' must be inherited to 
itv. Technically, this is achieved by inheriting the 
firstinW- attribute from itv' to itv, see line 18 of Algo- 
rithm 2. 

Whenever a terminal edge for read r p , outgoing from an 
interval itv is processed (line 11), p is added to the list T. 
Suppose that this terminal edge is outgoing from the lep- 
interval itv. The first symbol of the label of the terminal 
edge is % p . Suppose there is a branch-edge outgoing from 
itv to some lep-interval itv' . Then the first symbol, say a, 



of the implicit label of this edge must occur more than 
once. Thus it cannot be a sentinel, as these are consid- 
ered different in the lexicographic ordering of the suf- 
fixes. Hence the first symbol a is either A, C, G or T. As 
these symbols are, with respect to the lexicographic order, 
smaller than the sentinels, the branch-edge from itv to itv' 
appears before the terminal edge from itv. Hence the ter- 
minal edges outgoing from itv' have been processed in line 
25, and so we only need a single list T for the entire 
algorithm. 

As soon as all edges outgoing from itv have been pro- 
cessed, we have collected the terminal edges in T and 
the whole-read leaves in W. If itv.lcp exceeds the mini- 
mum length, Algorithm 2 computes the cartesian prod- 
uct of T with the appropriate subset of W and processes 
the corresponding suffix-prefix matches of length itv.lcp 
in line 25. At this point suffix-prefix matches may be 
output or post-processed to check for additional con- 
straints, such as transitivity. Once the cartesian product 
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Input: table L and sorted array H of SPM-relevant suffixes 
with common prefix u of length k 
generic function process to postprocess an SPM 
Output: All suffix-prefix matches (r. s, i) 

such that I > l m i n and u is a prefix of s 
1: T := [ ] > empty list 

2: W := [ ] > empty list 

3: with each lcp-interval associate an integer itv.firstinW 
4: run Algorithm 1 with the following functions: 

5: function process_leafedge{firstedge . itv. (p,q)) > p is read number and q is offset 

6: if itv.lcp > i m i n then 

7: if firstedge then 

8: itv.firstinW := \W\ + 1 

9: if q — 0 then > (p, q) refers to whole read 

10: append p to W 

11: if q + itv.lcp = r p then 

12: append p to T 

13: else 
14: W := [ ] 

15: function process^branchedge(firsledge,itv,itv / ) 

16: if itv.lcp > i m i n then 

17: if firstedge then 

18; itv.firstinW :— itv' .firstinW 

19: else 

20: W := [ ] 

21: function proccss_lcpinterval(itv) 
22: if itv.lcp > ^„^ n then 
23: for all p £ T do 

24: for all j £ W[itv. firstinW . . . \W\\ do 

25: process (r p ,rj, itv.lcp) > call generic function to process SPMs 

26: T := [ 1 



Figure 2 Algorithm 2. Bottom-up traversal of lcp-interval tree enumerating suffix-prefix matches. 



has been computed, the elements from T are no longer 
needed and T is emptied (line 26). Note that the algo- 
rithm empties W once an lcp-interval of lep-value smal- 
ler than i min is processed. After this event, there will 
only be terminal edges from v-intervals such that the 
longest common prefix of v and the reads in W is smal- 
ler than t min . Therefore there will be no suffix-prefix 
match of the form (_, w, I) such that I > l mi „ and w is a 
read represented in W. So the list can safely be emptied. 

The lcp-interval tree for H and L contains ji leaf-edges. 
As all lcp-intervals have at least two children, there are at 
most P - 1 branch-edges and ji lcp-intervals. As each of 
the three functions specified in Algorithm 2 is called once 
for every corresponding item, the number of functions 
calls is at most 3/? - 1. Recall that Algorithm 2 is applied to 
each bucket and the total size of all buckets is g. 
Hence there are at most 3g - d calls to the three func- 
tions, process _leaj f edge and process _branchedge run in con- 
stant time. The running time of process _lcpinterval is 
determined by the number of SPMs processed. Assuming 
that the processing takes constant time, the overall run- 
ning time of Algorithm 2 for all buckets is 0(g + z) where 
z is the number of processed SPMs. 



Handling reverse complements of reads 

Reads may originate from both strands of a DNA mol- 
ecule. For this reason, suffix-prefix matches shall also be 
computed between reads and reverse complements of 
other reads. Handling the reverse complements of all 
reads is conceptually easy to integrate into our approach: 
One just has to process 1Z instead of 7?.. 

The three steps which involve scanning the reads 
are extended to process both strands of all reads. 
This does not require doubling the size of the read 
set representation, as all information for the reverse 
complemented reads can efficiently be extracted from 
the forward reads. Additional file 1, Section 5, shows 
how to compute the integer codes for the reversed 
reads from the integer codes of the forward reads in 
constant time. 

The scan of the reverse complemented reads has a 
negligible impact on the runtime. Of course, the size of 
the table S, K and PS roughly doubles when additionally 
considering reverse complements. 

When computing suffix-prefix matches some minor 
modifications are necessary: Applying Algorithm 2 to TZ 
finds all SPMs, including some redundant ones, which 
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we want to omit. This is formalized as follows: an SPM 
(r,s,Z) £ SPM(TZ) is non-redundant if and only if one 
of the following conditions is true: 

• re1Z,s£lZ 

• r e lZ,s e lZ 7 r <s 

• r £ 7Z,s & 7Z,s <T ■ 

For any SPM, these conditions can easily be checked in 
constant time, see Algorithm 3 (Additional file 1, Section 7). 

Recognition of transitive and irreducible 
suffix-prefix matches 

For the construction of the string graph, we do not 
need transitive SPMs. An SPM (r,t,Z") is transitive if 
and only if there are two SPMs (r,s,Z) and (s, t, Z') 
such that Z + Z' = \s\+Z". Figure 3 shows a concrete 
example of a transitive SPM. An SPM which is not 
transitive is irreducible. 

The following theorem characterizes an SPM by a read 
and a single irreducible SPM satisfying a length constraint 
and a match constraint, see Figure 4 for an illustration. 

Theorem 1. Let (r, t, £") be an SPM. Then (r, t, Z") is 
transitive if and only if there is an s € TZ and an irredu- 
cible SPM {s,t,Z') such that £'>£", \r\ -I" > \s\ -Z' 
and s[l...\s\-Z']=r[\r\-Z" -(\s\-Z') + l...\r\-Z"]. 

The proof of Theorem 1 can be found in Additional 
file 1, Section 6. 

If the SPM (r,t,Z") is transitive and (s,t,Z') is the 
SPM satisfying the conditions of Theorem 1, then we say 
that (r,t,Z") is derived from (s,t,Z'). 

Theorem 1 suggests a way to decide the transitivity of an 
SPM (r,t,Z): Check if there is an irreducible SPM (s,t,Z') 
from which it is derived. The check involves comparison of 
up to \s\ -£' symbols to verify if s[l. . .\s\ - £'] is a suffix 
of r[\ . . \r\ — £"]. As there may be several irreducible 
SPMs from which (r,t,Z") may be derived, it is neces- 
sary to store the corresponding left contexts: For any 
sequence s and any £', 1 < £' < \s\, the left context LC{s,Z') 
of s is the non-empty string s[l. . .\s\ - £']. 

Due to the bottom-up nature of the traversal in 
Algorithm 2, the SPMs involving the different prefixes of a 
given read are enumerated in order of match length, from 
the longest to the shortest one. Thus, Algorithm 2 first 

r ttacgtacgtagctagctagct 

s acgtagctagctagcttactag 

t ctagctagcttactagtcagct 

Figure 3 Example of a transitive suffix-prefix match. An example 
of a transitive SPM. A set of three reads with a transitive SPM 
(r, f, 10) derived from (s,f, 16). 



delivers the irreducible SPM (s,t,Z') from which (r,t,Z") 
is possibly derived, because £ ' > £ " . 

From Theorem 1 one can conclude that the first SPM, 
say (s,t,Z'}, found on a whole-read path for t is always irre- 
ducible. Hence, one stores LC(s, £'). An SPM (r,t,Z") 
detected later while traversing the same whole-read path 
for t is classified as transitive if and only if LC(s, £') is a suf- 
fix of LC(r,Z") (see Figure 5 for an illustration). If (r,t,Z") 
is transitive it is discarded. Otherwise, LC(r,Z") must be 
stored as well to check the transitivity of the SPMs found 
later for the same whole-read path. So each SPM is either 
classified as transitive, or irreducible, in which case a left 
context is stored. To implement this method, we use a dic- 
tionary D of left contexts, with an operation LCsearch 
{D,s), which returns true if there is some t&D such that t 
is a suffix of s. Otherwise, it adds s to D and returns false. 
Such a dictionary can, for example, be implemented by a 
trie [19] storing the left contexts in reverse order. In our 
implementation we use a blind- trie [20]. In Additional file 
1, Section 7 we present a modification of Algorithm 2 to 
output non-redundant irreducible SPMs only. 

Recognition of internally contained reads 

At the beginning of the methods section we have shown 
how to detect reads which are prefixes or suffixes of other 
reads. When constructing the string graph we also have to 
discard internally contained reads, which are contained in 
other reads without being a suffix or a prefix. More pre- 
cisely, r E TZ is internally contained, if a read r' £ TZ exists, 
such that r' = urw for some non-empty strings u and v. In 
Additional file 1, Section 8, we show how to efficiently de- 
tect internally contained reads. 

Construction of the assembly string graph 

Consider a read set TZ which is suffix- and prefix-free. The 
assembly string graph [8] is a graph of the relationships 
between the reads, constructed from SPM nr (TZ ) , the set 
of all non-redundant irreducible SPMs from SPM (TZ) 
restricted to reads which are not internally contained. 

For each r 6 TZ the graph contains two vertices 
denoted by r.B and r.E, representing the two extremities 
of the read. B stands for begin, E stands for end. 

For each non-redundant irreducible SPM 
(r,s,Z) eSPM nr (TZ) satisfying Z>i min , the graph con- 
tains two directed edges, defined as follows: 

1. if (r,s,Z) e SPM nr (TZ) there are two edges: 

• r.E^s.E with edge label s[l + 1. . .|s|] 

• s.B — » r.B with edge label r[£+l . . . |r|] 

2. if (r,s,Z) e SPM nr {TZ) there are two edges: 

• r.E^> s.B with edge label s[£+l . . . \s\] 

• s.E— > r.B with edge label r[£+l . . . \r\] 
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Figure 4 Illustration of transitivity of a suffix-prefix match. Schematic illustration of transitivity, (r, t,i") is a transitive SPM derived from the 
SPM (s, r, £'). Hence, the prefix v of s is a suffix of r[1 . . .|r|-£"]. 



3. if (5 , r,l) £ SPM m (K) there are two edges: 

• r.B — > s.£ with edge label s[£ + 1. . . |s|] 

• s.B — > r.E with edge label r[£ + 1 . . . | r\ ] 

In our implementation of the string graph, vertices are 
represented by integers from 0 to 2m - 1. To construct 
the graph from the list of non-redundant irreducible 
SPMs, we first calculate the outdegree of each vertex. 
From the counts we calculate partial sums. In a second 
scan over the list of SPMs, we insert the edges in an 
array of size 2p, where p = \SPM m (lZ) | . This strategy 
allows to allocate exactly the necessary space for the 
edges and to access the first edge outgoing from a 
vertex in constant time. The array of edges is stored 
compactly using 2p{\log 2 {2m)~\ + \log 2 {\ m!a - t m in)~\) 
bits, where A max is the maximum length of a read. 
\logi(2m)\ bits are used for the destination of an 
edge (the source of the edge is clear from the array 
index where the edge is stored). \log 2 (X mm — t m i„)~\ 
bits are used for the length of the edge label. 

To output the contigs, we first write references (such 
as read numbers and edge lengths) to a temporary file. 
Once this is completed, the memory for the string graph 
is deallocated, and the read sequences are mapped into 
memory. Finally, the sequences of the contigs are 
derived from the references and the contigs are output. 

To verify the correctness of our string graph implemen- 
tation and to allow comparison with other tools, we have 
implemented the graph cleaning algorithms described in 



[9] as an experimental feature. More sophisticated techni- 
ques, such as the network flow approach described in [8], 
are left for future work, as the main focus of this paper lies 
in the efficient computation of the irreducible SPMs and 
the construction of the string graph. 

Results 

The presented methods for constructing the string 
graph and the subsequent computation of contigs have 
been implemented in a sequence assembler named 
Readjoiner, which is part of the GenomeTools software 
suite [21]. The Readjoiner pipeline consists of three 
steps: 

• Readjoiner prefilter takes a read set in form of one 
or more Fasta-formatted files and removes reads 
containing ambiguity codes and reads which are 
prefixes or suffixes of other reads. It outputs the 
prefiltered reads in the Gt£«aseg , -format [16]. 

• Readjoiner overlap maps the representation of 
prefiltered reads in memory, enumerates non- 
redundant irreducible suffix-prefix matches, and 
stores them on file. The time/space tradeoff for this 
step can be adjusted by an option specifying the 
number of parts in which the sorted array of 
SPM-relevant suffixes is computed. Alternatively, 
one can specify a memory limit, according to which 
the minimum number of parts is determined to not 
exceed this limit. 
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Figure 5 Transitivity and left contexts. Transitivity and left contexts. Let the SPM (f, r, \z\) be derived from (5, r, \zz'\). Hence the left context 
LC(s, |z/|) = xy is a suffix of the left context LC[t, |z|) = wxy. Let ((J,r,|z|) be an irreducible SPM. Then LC{u, \yz\) = vy for some non empty string v and 
LC(s, \zz"\) is not a suffix of vy. 
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• Readjoiner assembly builds the string graph and 
traverses it to output the contigs. 

Experimental setup 

For our benchmarks, the 64bit GenomeTools binary was 
compiled by gcc version 4.3.2 using the provided Makefile 
with the option "64bit = yes assert = no amalgamation = yes". 
These last two options trigger the build-system to not 
compile assertions and to generate a single C-code file 
from which an amalgamation object is compiled, thus 
allowing for a maximum of inlined code, which in turn is 
executed faster. 

All tests were performed on a computer with a 2.40 Ghz 
Intel Xeon E5620 4-core processor, 64 GB RAM, under a 
64bit Linux operating system, using a single core only. 

For memory usage measurements, we monitored the 
VmHWM ("high water mark") value in the /proc file 
system [22] associated with the process of each particu- 
lar program over the time of its execution, including 
both allocated heap memory and memory made avail- 
able via the mmap() system call. The running time is the 
CPU time (sum of user and system time) as measured 
using GNU time. 

For all runs of Readjoiner we used /c=32, k' = max{8, 
\login \ — 8} and k" = k' - 1. If not stated otherwise, the 
number of parts in which Readjoiner computes the sorted 
array of SPM-relevant suffixes was 7. All programs were 
run with l min = 45 (which is the default minimum match 
length in SGA). 

Human genome sequencing simulations 

We tested our assembler on simulated error-free sequen- 
cing read sets based on human genomic sequences (latest 
available release of GRCh37). For each human chromo- 
some we prepared a template sequence by deleting ambi- 
guity symbols. Then we simulated reads by pseudo 
random sampling of the template sequence and its reverse 
complement, until the desired number of reads was 
obtained. This was done using the GenomeTools sim- 
reads-tool and is the same procedure as used in [11,13]. 

From each of the 24 human chromosome sequences, 
we generated a separate read set with 20 x coverage and 
a constant read length of 100 bp. The read set are called 
cl, c2, . . ., c22, cX, cY. Furthermore, using the entire 
human genome as template we generated read sets with 
20x, 30x and 40x coverage, referred to by hg20x, hg30x 
and hg40x, respectively. Additionally, from chromo- 
some 22, a set of reads of variable length was pre- 
pared: The results for this dataset are reported in 
Additional file 1, Section 9. 

In a first computational experiment, we determined the 
time vs. space tradeoff of our partitioning strategy, by ap- 
plying Readjoiner to c2 with a varying number of parts 
ranging from 1 to 9. The results are shown in Figure 6. 



The complete Readjoiner-pipeline was applied to 
each of the 24 datasets cl, c2, c3, . . ., c22, cX, cY. We 
considered the running time as a function of the 
length of each chromosome from which the dataset 
was generated and performed a linear regression, 
which delivered an 7? 2 -value of 0.997. The same was 
done for the space peak, delivering an R -value of 
0.998. Figure 7 shows plots of the time vs. length and 
space peak vs. length functions. 

Comparison with other string graph-based assemblers 

The 64bit Linux binaries of Edena [9] were downloaded 
from [23]. We tested version 2.1.1 and version 3 
devl 10920. Edena 3 is an untested version and under active 
development. In our tests, it required slightly more memory 
and was significantly slower than Edena version 2.1.1, 
which we therefore selected for the comparative test. The 
source code of SGA (version 0.9.13) was obtained from its 
public GitHub repository [24]. The 64-bit Linux binary of 
LEAP was downloaded from [25]. 

As Edena is based on the original string graph con- 
struction method proposed by [8], a comparison to 
Readjoiner allows validating our construction method. 
Table 1 reports the performance of Edena and Readjoiner 
for the assembly of datasets c22, cl5 and c7. These and 
additionally c2 were used as benchmark sets in [11,13]. 
For all three datasets, Edena and Readjoiner produce 
exactly the same list of irreducible SPMs, which allows 
concluding that the string graphs are identical. The 
resulting contig sets are almost identical: Edena was 
slightly more stringent in the output of the smallest con- 
tigs. Readjoiner was 13 - 14x faster than Edena and used 
about 11% of the space used by Edena. Due to a segmenta- 
tion fault, Edena did not complete the overlap phase for 
the largest chromosome dataset c2. 

SGA [12] is based on the direct string graph construc- 
tion methods first introduced in [11]. Currently, to the 
best of our knowledge, it is the only other open source 
string graph-based assembler, besides Readjoiner. The 
SGA default pipeline consists of the index, overlap and 
assemble tools: However, memory can be reduced by 
using sga index, rmdup and fm-merge [12]. The param- 
eter d of the SGA-index phase allows selecting the num- 
ber of sequences to be processed at a time: without this 
parameter, the index phase requires much more memory 
than other phases. With memory being the most limiting 
factor in the assembly, we optimized the space peak of 
SGA by gradually reducing the value of d until either the 
space peak of the index phase was less than the space 
peak of the other phases, or a further decrease of d did 
not reduce the space peak. SGAs index construction and 
overlap calculation can be threaded. However, for a fair 
comparison we used a single thread. Table 2 reports the 
results of running SGA and Readjoiner for the datasets 
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Figure 6 Influence of the partitioning technique on space and time requirement. Influence of the partitioning technique on space and 
time requirement. Running time and space peak of Readjoiner for the index construction of the c2 dataset with a varying number of parts 
(from 1 to 9, £ m/ „=45). 




Figure 7 Running time and space peak for Readjoiner for all 24 read sets derived from human chromosomes. Running time (A) and 
space peak (B) of Readjoiner for all 24 read sets c1,c2, . . ., c22,cX, cY derived from the human chromosomes (E m(n = 45). Each dot represents a 
human chromosome placed on the X-axes according to its length and on the Y-axes according to the running time (A) and the space peak (B) 
required by Readjoiner to process it. The line was fitted to the dots using the least square regression command Im from the R-project 
http://www.r-project.org/. This delivered R 2 = 0.997 for the running time and R 2 = 0.998 for the space peak. 
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Table 1 Comparison of Readjoiner and Edena 





RJ 


Edena 


Edena 
RJ 


RJ 


Edena 


Edena 
RJ 


RJ 


Edena 


Edena 
RJ 


Read set 


c22 


c22 


_ 


c15 


c15 




c7 


c7 


_ 


Genome size (Mbp) 


34.9 


34.9 


- 


81.7 


81.7 


- 


155.4 


155.4 


- 


Number of reads (M) 


7.0 


7.0 


- 


16.3 


16.3 


- 


31.1 


31.1 


- 


Contained reads (K) 


686.4 


6864 


- 


1665.7 


1665.7 


- 


3103.0 


3103.0 


- 


Irreducible SPM (M) 


7.2 


7.2 


- 


17.2 


17.2 


- 


36.4 


36.4 


- 


Overall time (s) 


360 


4903 


13.62X 


945 


13609 


14.40X 


2035 


29404 


14.45X 


Overall space (MB) 


294 


2753 


9.35X 


703 


6415 


9.1 3x 


1331 


12255 


9.21 X 


Contigs 


120712 


1 20462 




254830 


254111 




503446 


502706 




Total contigs length (Mbp) 


45.7 


44.7 




103.0 


101.1 




198.8 


195.0 




Assembly N50 (Kbp) 


1.6 


1.7 




2.4 


2.5 




2.3 


2.4 




Assembly NG50 (Kbp) 


2.7 


2.7 




3.7 


3.7 




3.9 


3.9 




Longest contig (Kbp) 


41.4 


414 




54.2 


54.2 




44.9 


44.9 





Results of applying Readjoiner (RJ) and Edena to the datasets c22, c15, c7 (£ mi „ = 45). 



c22, cl5, c7 and c2. Readjoiner was 19x to 20x faster than 
SGA and used 14% to 23% less space than SGA. Using 
overlap and assemble instead of fm-merge, SGA became 
slightly faster but required about 7 times more memory 
(data not shown). By default, SGA is less stringent than 
Readjoiner when computing the contigs from the string 
graph, thus producing more contigs with a lower N50 
value. For each of the four datasets, the longest contig 
produced by SGA and Readjoiner is identical, and the 
NG50 value of the assembly is comparable. 

LEAP implements the methods described in [13] to 
construct a full overlap graph. The efficiency of LEAP is 
remarkable, and allowed us to extend the comparison 
with Readjoiner to the hg20x dataset. Table 3 reports the 
results when applying Readjoiner and LEAP to the data- 
sets c22, c2 and hg20x. Additionally, it shows the results 
for Readjoiner on hg30x and hg40x (for which LEAP 
was not able to complete the overlap phase on our test 

Table 2 Comparison of Readjoiner and SGA 



machine with 64 GB RAM). Readjoiner was faster than 
LEAP for all datasets with a speedup factor of 1.6 to 1.8. 
Furthermore, it required less memory: While the reduc- 
tion in the space peak was at a maximum for the small 
datasets (c22, 2.99x), it is still significant (1.63 x ) for the 
hg20x dataset which contains almost two orders of mag- 
nitude more reads. In approximately the same time in 
which LEAP assembles hg20x , and using less memory, 
Readjoiner was able to assemble the hg30x dataset. 
Readjoiner was also able to assemble the hg40x dataset 
in 51 hours using 52 GB RAM. 

Evaluation of assemblies 

In order to assess the quality of the assemblies delivered 
by the different programs, we used the script assess_as- 
sembly.pl of the Plantagora project [26]. The script 
aligns the contigs to the template sequence from which 
the reads were sampled, using the Nucmer alignment 





RJ 


SGA 


SGA 
RJ 


RJ 


SGA 


SGA 
RJ 


RJ 


SGA 


SGA 
RJ 


RJ 


SGA 


SGA 
RJ 


Read set 


c22 


c22 




c15 


c15 




c7 


c7 




c2 


c2 




Genome size (Mbp) 


34.9 


34.9 




81.7 


81.7 




155.4 


155.4 




238.2 


238.2 




Number of reads (M) 


7.0 


7.0 




16.3 


16.3 




31.1 


31.1 




47.6 


47.6 




Sga index -d (K) 




300 






700 






1350 






2300 




Overall time (s) 


360 


7508 


20.86X 


945 


19334 


20.46X 


2035 


39988 


19.65X 


3185 


65194 


2047X 


Overall space (MB) 


294 


383 


1.30X 


703 


842 


1.20X 


1331 


1568 


1.1 8x 


2094 


2436 


1.1 6x 


Contigs 


120712 


231594 




254830 


547217 




503446 


1215816 




634403 


1702714 




Total contigs length (Mbp) 


45.7 


55.9 




103.0 


130.5 




198.8 


266.4 




292.2 


396.1 




Assembly N50 (Kbp) 


1.6 


0.8 




2.4 


1.0 




2.3 


0.5 




3.2 


1.2 




Assembly NG50 (Kbp) 


2.7 


2.7 




3.7 


3.7 




3.9 


3.9 




4.5 


4.5 




Longest contig (Kbp) 


41.4 


41.4 




54.2 


54.2 




44.9 


44.9 




52.9 


52.9 





Results of applying Readjoiner (RJ) and SGA to the datasets c22, c15, c7, c2 (£„,,-„ = 45). 
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Table 3 Comparison of Readjoiner and LEAP 





RJ 


LEAP 


LEAP RJ 
RJ 


LEAP 


LEAP RJ 
RJ 


LEAP 


LEAP 
RJ 


RJ 


RJ 


Read set 


c22 


c22 


c2 


c2 


hg20x 


hg20x 


- 


hg30x 


hg40x 


Genome size (Mbp) 


34.9 


34.9 


238.2 


238.2 


2861.3 


2861.3 


- 


2861.3 


2861.3 


Number of reads (M) 


7.0 


7.0 


47.6 


47.6 


579.5 


579.5 


- 


869.2 


1155.3 


Overall time 


6 min 


9 min 


1.60x 53 min 


1 h 36 min 


1 .81 x 20 h 4 min 


35 h 56 min 


1.79X 


34 h 9 min 


51 h 16 min 


Overall space (GB) 


0.3 


0.9 


2.99X 2.0 


4.0 


1.98X 27.9 


45.6 


1.63X 


39.8 


52.0 


Contigs 


120712 


1 1 3428 


634403 


630408 


3239309 


11662607 




1 3497497 


16253905 


Total contigs length (Mbp) 


45.7 


43.1 


292.2 


280.6 


2833.1 


3642.7 




4003.9 


4281.1 


Assembly N50 (Kbp) 


1.6 


1.6 


3.2 


3.0 


3.0 


1.4 




1.2 


0.9 


Assembly NG50 (Kbp) 


2.7 


2.4 


4.5 


3.9 


3.0 


2.5 




2.9 


2.8 


Longest contig (Kbp) 


41.4 


39.4 


52.9 


48.9 


63.4 


58.6 




63.4 


63.4 



Results of applying Readjoiner (RJ) and LEAP to the datasets c22, c2, hg20x , hg30x , hg40x (l ml „ = 45). LEAP was not able to process hg30x and hg40x on the test 
machine with 64 GB RAM. 



tool [27]. Several metrics, including the number of un- 
aligned contigs, misassemblies, SNPs and gaps are reported. 

Furthermore, assemblies were evaluated using the 
basic Assemblathon 1 statistics as defined in [28], in- 
cluding total length of the contigs, length of the longest 
and shortest contig, N50, L50, NG50 and LG50. Table 4 
reports the results of the evaluation of the assemblies of 
dataset c22. 

Effect of sequencing errors 

Readjoiner is based on the computation of exact suffix- 
prefix matches. Real-world datasets, however, contain a 
certain amount of errors. To better assess the effect of 
sequencing errors on the assembly, we sampled two sets 
of reads from the Escherichia coli K-12 genome, each 
consisting of 2 million reads: The first one (Ecoli-without- 
errors) is error-free, while in the second read set (Ecoli- 
with-errors) sequencing errors were introduced using a 
0.75% position-independent substitution rate. 

In order to assess the efficiency of /c-mer based error 
correction (which we plan to implement in Readjoiner), 
we pre-processed Ecoli-with-errors using SGA. This con- 
structs an FM-index from the reads set and errors are 
corrected using sga correct, delivering the read set 
termed Ecoli-SGA-corrected. This is further processed by 
sga filter, after constructing a new FM-index, to elimin- 
ate further error-containing reads. This resulting set 
Ecoli-SGA-corrected+filtered was assembled using both 
Readjoiner and SGA. Table 5 gives the most important 
statistics of the Readjoiner and SGA assemblies of the 
four different read sets defined here. 

Discussion and conclusion 

In this paper, we presented methods and implementation 
techniques of a new string graph based assembler, named 
Readjoiner, which is significantly faster or more space 



efficient than the previous software tools Edena [9], SGA 
[11] and LEAP [13]. In particular, Readjoiner can handle a 
set of reads with 40x coverage of the entire human gen- 
ome (total length of all reads 115 Gbp) on a machine with 
64 GB RAM. 

Although the different string graph-based assemblers 
aim at constructing the same graph, they apply different 
heuristics to compute a layout from the string graph. 
The quality of assemblies of simulated datasets was 
compared using metrics from the Plantagora project 
[26] and the Assemblathon 1 project [28]. In the assem- 
blies of c22 delivered by Readjoiner, SGA and Edena 
there are 4 misassembled contigs. In contrast, 53.4% of 
the contigs of the LEAP assembly could not be aligned 
to the reference and 4.3% of the aligned contigs were 
misassembled. The "Negative Gaps" metric computed by 
Plantagora reflects the amount of overlaps among the 
contigs. Its high value for all tools can be explained by 
the fact that branching nodes in the string graph start 
new contigs in which the read corresponding to the 
branching node is included. Additionally considering the 
"Positive Gaps" metrics, one can conclude that most 
contigs were interrupted due to the presence of repeti- 
tive sequences, but not due to low coverage. 

Our main development is a new efficient algorithm 
to compute all irreducible suffix-prefix matches from 
which the string graph is constructed. While the basic 
techniques we use (e.g. integer encodings, suffix sort- 
ing, integer sorting, binary search, bottom-up traversal 
of lcp-interval trees) are mostly well-established in se- 
quence processing, their combination is novel for the 
considered problem. The different techniques were chosen 
with the overall goal of performing as few as possible ran- 
dom accesses to large data structures to obtain algorithms 
with excellent data locality which in turn leads to 
short run times. For most parts of our method, this 
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Table 4 Metrics of assemblies of the c22 dataset 


Assemblathon metrics 


RJ 


SGA 


Edena 


LEAP 


Number of contigs 


120712 


231594 


1 20462 


1 1 3428 


Genome size (bp) 


34894545 


34894545 


34894545 


34894545 


Total contigs length 


45667531 


55880641 


44737441 


43099113 


- as % of genome 


130.87 


160.14 


128.21 


123.51 


Mean contig size 


378.32 


241.29 


371.38 


379.97 


Median contig size 


132 


101 


120 


117 


Longest contig 


41352 


41352 


41352 


39379 


Shortest contig 


102 


100 


100 


101 


Contigs > 500 bp 


13467 (11.16%) 


1 341 6 (5.79%) 


13439 (11.16%) 


13430 (11.84%) 


Contigs > 1 Kbp 


8700 (7.21%) 


8684 (3.75%) 


8696 (7.22%) 


8578 (7.56%) 


Contigs > 10 Kbp 


264 (0.22%) 


264 (0.11%) 


264 (0.22%) 


228 (0.20%) 


N50 


1614 


815 


1699 


1617 


L50 


5684 


10118 


5416 


5488 


NG50 


2737 


2739 


2733 


2461 


LG50 


3120 


3113 


3121 


3429 


Plantagora metrics 


RJ 


SGA 


Edena 


LEAP 


Covered Bases 


34343945 


34357693 


34300114 


12968118 


Ambiguous Bases 


159997 


154584 


182952 


696334 


Misassemblies 


4 


4 


4 


3693 


Misassembled Contigs 


4 


4 


4 


2344 


Misassembled Contig Bases 


1283 


---11/ 


1245 


2797710 


SNPs 


104 


125 


120 


46270 


nsertions 


5 


2 


1 


2403 


Deletions 


43 


23 


28 


5187 


Positive Gaps 


2679 


2471 


2925 


26495 


Internal Gaps 


0 


0 


0 


21 


External Gaps 


2679 


2471 


2925 


26474 


- total length 


547408 


558921 


589979 


19064103 


- average length 


204 


226 


202 


720 


Negative Gaps 


110888 


218908 


110811 


18198 


Internal Overlaps 


0 


0 


0 


17 


External Overlaps 


110888 


218908 


110811 


18181 


- total length 


-10247647 


-20078971 


-9424823 


-1859835 


- average length 


-92 


-92 


-85 


-102 


Redundant Contigs 


864 


1158 


607 


6329 


Unaligned Contigs 


3262 


4686 


3221 


60563 


- partial 


18 


57 


21 


3252 


- total length 


462668 


599320 


447922 


27666823 


Ambiguous Contigs 


2631 


3876 


2619 


799 


- total length 


369284 


483895 


366418 


93102 



Metrics of the assemblies of the dataset c22 as delivered by Readjoiner (RJ), SGA, Edena and LEAP (£ m ,„=45). The metrics are explained in [26] and in [28]. 
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Table 5 Assembly of error-containing reads 





NSO (bp) 




NG50 (bp) 






RJ 


SGA 


RJ 


SGA 


Ecoli-without-errors 


54948 


54936 


57213 


57210 


Ecoli-with-errors 


203 


5110 


245 


8645 


Ecoli-SGA-corrected 


38178 


40002 


39999 


40824 


Ecoli-SGA-corrected+filtered 


41872 


41872 


41905 


41903 



N50 and NG50 values for the Readjoiner and SGA assemblies of Ecoli-without- 
errors, Ecoli-with-errors, Ecoli-SGA-corrected and Ecoli-SGA-corrected+filtered. 



goal was achieved, mostly due to the partitioning of 
the set of SPM-relevant suffixes. There are still many 
random accesses to the representation of the reads, 
which, however, cannot fully be prevented in an index 
based approach. 

The problem of computing suffix-prefix matches has 
long been studied in the literature, mostly with the goal 
of finding, for each pair of reads r and s, the longest 
suffix-prefix match of r and s. Gusfield et al. [29] solved 
this maximum suffix-prefix matching problem in opti- 
mal 0{n + m 2 ) time and optimal 0(n) space using the 
suffix tree for all suffixes of m reads of total length n. 

Ohlebusch and Gog [30] present a solution to the same 
problem with the same time and space complexity using a 
linear scan of an enhanced suffix array. We do not know 
of any solution of the maximum suffix-prefix match prob- 
lem which appropriately handles the reverse complements 
of the reads. Applying the algorithms of [29] or of [30] to 
the set of all reads and their reverse complements would 
not guarantee the maximality constraint, as the forward 
and reverse complement of a read are represented in dif- 
ferent locations of the employed index structure. 

In Edena, suffix-prefix matches are computed using a 
suffix array. Details of the algorithm or the implementa- 
tion are not published. 

Like Simpson and Durbin [11], we replaced the maxim- 
ality constraint by a minimum length constraint imposed 
on each suffix-prefix match. The modified problem allows 
for an algorithm with two important advantages (com- 
pared to the algorithms of [29,30]): At first, the algorithm 
does not require a stack for each of m reads, and still can 
employ the space and time efficient bottom-up traversal 
of an lcp-interval tree as presented in Algorithm 1. More- 
over, the algorithm can easily handle reverse complements 
of the reads and efficient selection of irreducible suffix- 
prefix matches is possible. 

There are two main approaches to the construction of 
a string graph. The original approach of Myers [8] was 
to first construct a full overlap graph before transitive 
edges are removed. The resulting string graph contains 
all information relevant for the layout of the contigs. As 
the string graph contains much less edges than the over- 
lap graph (the ratio depends on the coverage of the read 



set, see [8]), the explicit representation of this usually 
defines the space peak. 

An alternative overlap graph representation for exact 
suffix-prefix matches was introduced in [13] and imple- 
mented in the LEAP software. The basic idea of this ap- 
proach is to implicitly store many suffix-prefix matches 
for a set of lexicographically related reads in constant 
space using an interval representation. This allows for a 
compact storage of the full overlap graph. The represen- 
tation does not apply to irreducible SPMs. In [13] only 
asymptotic results regarding the space requirement of 
the compact overlap graph representation are given, and 
LEAP does not give any clues on the size of the graph it 
constructs. So it remains unclear if this representation of 
the overlap graph is smaller than our representation of 
the string graph. A comparison of the overall space re- 
quirement of LEAP and Readjoiner shows a clear advan- 
tage for Readjoiner, see Table 3 for details. 

It is worthwhile to note that the contigs output by 
LEAP contain many differences with respect to the tar- 
get sequences they were sampled from. It is not clear to 
us, whether this is an artifact of the method or an imple- 
mentation issue. 

Another efficient way to reduce the space peak for 
string graph constructions is to recognize transitive 
SPMs and prevent their insertion in the graph structure. 
Simpson and Durbin [11] developed the first method 
following this approach and implemented it in the SGA 
software. In this paper, we have described an alternative 
algorithm, exploiting a property of transitive SPMs that 
can easily be checked on a small set of strings. 

Our comparative tests (Table 2) indicate that Readjoiner 
is more than one order of magnitude faster than the 
current SGA implementation and uses less space. This 
may come as a surprise as SGA uses a compact index 
structure based on the BWT, while Readjoiner employs 
techniques known from enhanced suffix arrays, which are 
usually more space consuming. The space advantage of 
Readjoiner is mainly a result of our partitioning approach 
applied to the array of SPM-relevant suffixes. The parti- 
tioning technique leads to a large reduction in the overall 
memory peak and a small increase in the running time. 
This can be explained by an improved cache coherence: 
For a given part, only a small portion of the different 
tables are accessed. This seems to outweigh the time for 
the additional passes over the reads. 

We see two reasons for the time advantage of Readjoiner: 
at first it employs a suffix selection and sorting method 
which is specifically tailored for the suffix-prefix matching 
problem and the given minimum match length t mlK . In 
contrast, the BWT employed by SGA provides a general 
string indexing technique that is not optimized for com- 
puting SPMs of an arbitrary but fixed minimum length. 
Secondly, Readjoiner computes suffix-prefix matches by a 
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linear scan of two integer tables, which is a very fast oper- 
ation. In contrast, SGA relies on random accesses to the 
BWT which may take longer for large data sets. 

The minimum match length parameter i min is used to 
restrict the search to the exact SPMs that are considered to 
be significant. To balance the required computational 
resources and the quality of the assembly, one has to care- 
fully choose an appropriate value for i min = A5. A larger 
value of t min reduces the number of SPM-relevant suffixes, 
and in turn speeds up the computation and reduces the 
space requirement, but may lead to a poor assembly. Inter- 
estingly, in our simulations based on reads of length 
100 bp, we obtained the best assembly results for a rela- 
tively large value of £ m ,„ around 65. However, for a fair 
benchmarking of the tools and to simplify comparison with 
previous publications, we have chosen l min = 45. 

Among the string graph-based assemblers mentioned 
here, SGA is the only one that can distribute parts of the 
computation across multiple threads. Some of the algo- 
rithms employed in Readjoiner are well suited for a multi- 
threaded implementation. For example, each bucket of 
SPM-relevant suffixes is sorted independently and the cor- 
responding SPMs are computed independendy of all other 
buckets. This step only requires random read access to 
the representation of the reads. A multi-threaded imple- 
mentation with shared memory access to the reads and 
buckets which are (with respect to their sizes) evenly dis- 
tributed over the threads, is expected to provide a consid- 
erable speedup within a small amount of additional space. 

Another important issue for future development is the 
improvement of the assembly quality for real world data. 
Here further preprocessing steps, in particular quality filter- 
ing and error detection are required, as well as the handling 
of paired read information in the assembly phase. 

The present manuscript focuses on the algorithmic ap- 
proach and implementation of methods for the compu- 
tation of irreducible suffix-prefix matches and the 
construction of the string graph. We report our results 
on error-free datasets: This is in analogy to the first 
papers describing the methods implemented in SGA 
[11] and LEAP [13]. 

Several error correction strategies have been applied so 
far: The classical method was to consider approximative 
suffix-prefix matches of the reads and to correct the result- 
ing contigs in a consensus phase. With large next- 
generation datasets, the method of choice consist in /c-mer 
counting, identification of a subset of trusted /c-mer, which 
occur at least a given number of times in the read set, and 
correction of the reads containing untrusted k-mers [12,31]. 

Approximative suffix-prefix matching algorithms can be 
implemented to work on index structures, but the 
increased search space makes them significantly slower 
than exact matching algorithms. Among the string graph- 
based assemblers, only SGA implements an approximate 



suffix-prefix matching algorithm: Nevertheless, this is not 
used by default, and the authors recommend using their 
faster /c-mer based error correction method instead [12]. 

The fact that Readjoiner is based on exact suffix-prefix 
matches makes it sensible to errors. We have demonstrated 
that using a /c-mer based error correction step delivers read 
sets for which Readjoiner delivers assemblies with metrics 
comparable to SGA. We therefore plan to implement a 
A:-mer based error correction for Readjoiner, employ- 
ing techniques similar to those used for computing 
suffix-prefix matches. 

Paired-end and mate pairs provide short and long 
range positional information, which is critical for im- 
proving the quality of assembling eukaryotic genomes. 
The classical approach consists in using this informa- 
tion for connecting contigs into scaffolds either in a 
post-processing phase, which may be integrated in the 
assembler software, or using a stand-alone tool, such as 
Bambus [32] or SSPACE [33]. A complementary ap- 
proach, which we intend to introduce in a future ver- 
sion of our software, is to exploit the pairing 
information already during the traversal of the string 
graph, by restricting to paths connecting the mate pairs 
with a length compatible to the insert size of the library. 
Details of such an approach are given in [34,35]. 

Availability 

The Readjoiner software is available as part of the 
GenomeTools genome analysis package [21], a free, open 
source collection of bioinformatics software. See http:// 
www.zbh.uni-hamburg.de/readjoiner for more details. 

Additional file 



Additional file 1: Supplemental Material. This document describes 
implementation techniques for the methods and algorithms described in 
the main document. Moreover, it gives a lemma and a theorem 
(including proofs) characterizing transitive SPMs, and an algorithm to 
enumerate irreducible and non-redundant suffix-prefix matches. 
Furthermore, a method to recognize internally contained reads is given, 
as well as results for a benchmark set with reads of variable length. 
Finally, an example of SPM-relevant suffixes and their corresponding 
Icp-interval is presented. 
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